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Abstract 

In  this  memorandum  we  use  a  noise-robust  vector  quantiser  model  to  derive  ex¬ 
pressions  for  the  asymptotic  rode  vector  density  p  in  various  types  of  topographic 
vector  quantisers.  A  topographic  vector  quantiser  is  not  identical  to  a  standard  (ie 
Kohonen)  topographic  mapping,  but  the  differences  are  minimal.  In  all  the  cases 

that  we  study  (scalar  and  vector  quantisation  with  various  symmetric  topographic 

s 

neighbourhoods)  we  obtain  the  asymptotic  result  p  oc  ,  where  A’  is  the  input 

dimensionality  and  P  is  the  input  probability  density.  Thus  the  asymptotic  code 
vector  densities  of  a  topographic  vector  quantiser  and  a  standard  vector  quantiser 
are  the  same. 
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1  Introduction 


There  is  much  literature  on  vector  quantisation  (VQ)  and  scalar  quantisation  theory  [l,  2,  3j, 
where  an  encoding/decoding  scheme  is  optimised  in  such  a  way  as  to  minimise  a  distortion 
measure  (or  Lyapunov  function). 

There  is  also  an  interesting  class  of  transformations  (called  topographic  mappings  (TM) 
in  the  neural  network  literature  [4])  that  can  be  trained  to  perform  mappings  of  high 
dimensional  input  vectors  into  low  dimensional  output  vectors.  In  recent  work  [5,  6]  we 
showed  how  to  reformulate  the  problem  of  training  a  TM  as  a  problem  of  minimising  a 
distortion  measure:  this  required  a  slight  modification  of  the  original  training  algorithm, 
but  the  side  effects  of  this  were  minimal.  We  call  this  type  of  mapping  a  topographic 
vector  quantiser  (TVQ)  in  order  to  emphasise  its  close  relationship  to  a  plain  VQ,  and  to 
distinguish  it  from  the  standard  TM  method.  A  TVQ  has  the  important  property  that 
the  codes  that  they  produce  are  robust  with  respect  to  the  damaging  effects  of  code  noise 
corresponding  to  the  topographic  neighbourhood  used  during  training  (ie  minimisation  of 
the  distortion  measure). 

The  question  of  the  asymptotic  properties  of  TVQs  (versus  fhose  of  VQs)  naturally 
arises.  In  a  recent  study  [7]  the  asymptotic  code  vector  (CV)  density  in  a  scalar  TM  (1 
dimension  mapped  to  1  dimension)  was  derived,  and  found  to  be  proportional  to  P(x)a 
where  a  =  (2n  -  l)2, 3  ((n  -  l)2  -  n2),  P(x)  is  the  probability  density  of  input  scalars,  and 
n  is  the  half- width  of  the  update  neighbourhood  used  when  training  the  TM.  The  n  =  0  case 
reduces  to  a  -  1/3.  which  is  consistent  with  the  result  expected  from  a  scalar  quantiser. 

We  wish  to  derive  the  corresponding  TVQ  result  for  the  case  where  we  solve  a  minimum 
Li  distortion  problem  (as  formulated  in  [5,  6j),  rather  than  the  standard  TM  problem  (as 
formulated  in  A  ). 

In  §2  we  shall  present  the  TVQ  (as  opposed  to  the  TM)  version  of  the  derivation  that 
appeared  in  [7],  For  convenience,  we  present  a  summary  of  [7j  in  appendix  A  2.  In  §3 
we  shall  extend  this  result  to  the  more  general  case  of  a  symmetric  (and  monotonically 
decreasing  to  zero)  neighbourhood.  In  §4  we  extend  these  results  to  the  full  vectorial  case 
by  introducing  a  simple  VQ  model  that  can  be  used  to  derive  the  asymptotic  CV  density, 
and  in  §5  we  shall  extend  this  model  to  the  TVQ  case.  In  all  the  cases  that  we  study 
we  obtain  the  same  result  for  a  (in  the  TVQ  case)  that  we  would  have  obtained  in  the 
corresponding  VQ  case.  In  general  a  =  N/(N  +  2),  where  N  is  the  dimensionality  of  the 
input  vector. 


2  Density  in  1  dimension 


In  this  section  we  shall  derive  the  asymptotic  CV  density  for  a  topographic  scalar  quantiser, 
which  maps  a  1  dimensional  input  scalar  to  one  of  a  set  of  CVs  (or,  strictly  speaking,  code 
scalars).  We  shall  formulate  an  Li  (or  Euclidean)  distortion  to  take  account  of  the  same 


1  We  also  correct  a  number  of  serious  typographical  errors  that  appeared  in  [7]. 
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update  neighbourhood  that  was  used  in  [7]  in  order  that  we  may  obtain  comparable  results. 
For  completeness,  in  appendix  A  we  summarise  the  derivation  in  [7]. 

2.1  Euclidean  distortion 
Introduce  an  L 2  distortion  measure  D\  as 


•  -2  •  - 1  •  0  •  1  *2  y 


9-3  9-2  9-i  9o  9i  92 


Figure  1:  Encoding  and  decoding 

D'  =  JdzP(x)  -<(,)) 2 

~  'll  f  dxP{z)  (*  -  *y)  U) 

V  ■'«»-! 

In  figure  1  we  show  as  a  network  the  various  steps  that  are  involved  in  calculating  D\t  the 
figure  reads  from  the  bottom  to  the  top.  The  input  z  is  a  scalar  which  we  represent  by  the 
horizontal  axis  at  the  bottom  of  figure  1.  y(z)  is  an  encoding  function  that  maps  from  the 
input  z  to  an  index  y.  The  interval  igy-i,9y]  is  defined  as  follows 

=  i* :  y  =  y(*)}  (2) 

which  we  use  in  equation  (1)  to  partition  the  range  of  integration  into  a  set  of  convenient 
intervals.  The  output  x'y  is  the  CV  (or  decoding  function)  associated  with  code  index  y, 
and  it  sits  on  the  horizontal  axis  xy  that  we  have  drawn  at  the  top  of  figure  1.  The  set 
of  x'y  (ranging  over  all  vales  of  the  index  y)  comprises  the  codebook  that  is  used  in  this 
encoding/decoding  operation.  The  overall  goal  is  to  choose  the  encoding  y(z)  and  decoding 
x'y  functions  in  such  a  way  as  to  minimise  the  mean  L ■>  distortion  D\  between  input  z  and 

output  x'y{z) 
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2.2  Minimum  Euclidean  distortion 

In  order  to  minimise  D\  we  must  differentiate  it  with  respect  to  the  various  free  parameters: 
in  this  case  the  qy  (which  parameterise  the  encoding  function)  and  the  x'y  (which  parame- 
terise  the  decoding  function  y{x),  see  equation  (2)).  Thus  we  obtain  the  partial  derivatives 
as 

=  p{q v)  (9»-*;+i)* 

=  2P(qy)  (x'y+1  -  <)  (qy-*'v  +  *'v^  (3) 

=  ~2J'  dxP{x)(x  -  x’y) 

whence  the  stationary  points  of  D i  must  satisfy 

%  =  ^(4  ’  Vi) 

'  _  m^drp(x]z 

x"  “  m^xP(x) 

Note  that  equation  (oj  requires  that  qy  lies  midway  between  the  adjacent  CYs,  so  it  defines 
a  nearest  neighbour  encoding  function  y(x). 


2.3  Robust  Euclidean  distortion 

Now  we  shall  generalise  the  Z2  distortion  measure  of  equation  (1)  to  include  a  neighbourhood 
function  ~y\y  that  specifies  the  extent  to  which  y'  is  m  the  neighbourhood  of  y  iiaiei  on  we 
shall  define  this  notion  more  precisely).  Thus  introduce  the  Li  distortion  measure  Di  as 

Di  =  fdxP{x)  ]£v.v(«)  (*  -  zv') 

v' 

=  Y,[  ’  dxP{x)  (*  -  *y)J  (7) 

V  •'«»->  y' 

The  y(x)  (and  hence  the  q„)  and  the  x'yl  used  in  Di  are  to  be  understood  to  be  different 
from  those  used  in  D\.  In  figure  2  we  show  how  a  modified  version  of  figure  1  in  which  the 
effect  of  the  is  represented  (for  simplicity,  we  show  only  *V±i,„).  The  action  of  the 
is  interposed  between  the  action  of  y(r)  and  the  action  of  x'yl,  and  it  can  be  interpreted 
as  the  relative  probability  with  which  index  y  is  corrupted  by  some  distortion  process  to 
become  index  y1.  With  this  interpretation  in  mind,  it  is  evident  that  minimising  Di  with 
respect  to  the  choice  of  y(x)  and  x'y  will  cause  the  cnccding/decoding  process  to  Become 
robust  with  respect  to  the  damaging  effects  of  the  distortion  process  modelled  by  xyv  [5,  6], 


(5) 

(6) 
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9-3  9-2  9-i 
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Figure  2:  Encoding  and  decoding  in  the  presence  of  code  distortion 


2.4  Minimum  robust  Euclidean  distortion 


We  may  no*  repeat  the  derivation  of  §§2.2  for  D  2  (rather  than  D\).  The  partial  derivatives 


j  i)  *»  r — »  t  \  2  r — *  f 

=  P'-  9s,  1  2-V.v  -  Vi  "  2- V.y-1  («»  -  V 


dr  P|/)  (/  -  /' 


-  -2 Vf 
A. 


so  the  stationary  points  of  D 2  must  satisfy 


{(Ill  ry)  —  V.v+l  (^V  " 


**  = 


£y/v'-,  £ 

E,-/v.,  d/P(x)try  y 


In  equation  (10)  note  that  the  effect  of  *y,y  is  to  destroy  the  nearest  neighbour  encoding 
property  that  we  obtained  in  equation  (5). 


2.5  Finite  differences  and  derivatives 


In  this  subsection  we  collect  together  various  useful  results  that  we  need  in  order  to  derive 
the  asymptotic  CV  density.  In  order  to  make  direct  contact  with  the  results  that  were 
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reported  in  -7  we  shall  now  assume  a  specific  form  for  r 


1  if  W  ~  y\  <  n 

0  if  \y'  -  yi  >  n 


(12) 


This  7rv'  v  defines  a  uniform  neighbourhood  that  ranges  from  y  -  n  to  y  n  in  the  neigh 
bourhood  of  code  index  y ,  which  we  shall  call  a  [-n,+n]  neighbourhood. 


With  this  assumption  we  can  solve  equation  (10)  to  yield 


9y 


-  2  ZV+n*l) 


(13) 


which  should  be  compared  with  the  result  in  equation  (5)  (which  corresponds  to  making  tin- 
choice  y  =  <*!,', y.  or  n  =  0,  in  equation  (13)).  The  effect  of  the  [-n,  n )  neighbourhood 
is  to  replace  the  midpoint  of  the  interval  j]  by  the  midpoint  of  the  larger  interval 


-  2 


Now  introduce  a  pair  of  ex-pressions  to  relate  the  finite  differences  of  to  the  dt  ru  atives 
dr'y  dy  and  d7  r'  dy7  of  i'y 


'  v  -  * 


•V* 


o 

**  y 


dr'  (  d3r' 

2k~d y  ~  °  (*  ~dy* 


k‘ 


dy2 


i  14 


These  two  expressions  can  easily  lie  obtained  by  Taylor  expanding  (about  the  point  where 
the  code  index  h«v  value  y<  the  various  terms  on  the  left  hand  sides  of  equation  1 14  u 

V sine  equation  (  1 3  /  and  equat ion  (14/  we  may  derive  the  midpoint  uy  and  the  half-length 
o.  the  mterwi,  y  y  n  t .  9y  * as 


.>  ( 9y  •  »i  -  1  *  9y  -  n  ) 

If'  ' 

^  \^y-2n-l  ^y-2n*3 


oink'd 


dy 7 


dy 4 


(lo 


—  2  n  ~  9 y-n-  1  ) 

1  /  ,  ,  1 
7  V^y^Jn+l  Zy-2n-lJ 


2n  +  1  dz‘ 

—  ff  +  °[n 


dy 3 


(16) 


We  now  wish  to  transform  our  results  into  the  language  of  density  of  CYs  p(ry).  We 
can  relate  p(ry)  to  quantities  that  we  have  already  introduced,  as  follows 


,  >  \  dy 


(1- 


’Note  that  z’f  <.  <j,  <  i,.,, 


is  not  necessarily  true  when  n  >  0. 
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Thus  p{x'y)  is  the  number  of  CV  indices  y  per  unit  change  in  CV  position  x'y.  Ln  equation  (14) 
we  encountered  derivatives  of  x '  (up  to  the  fourth  order,  if  we  include  the  next  to  leading 
terms'  which  we  shall  now  express  in  terms  of  derivatives  of  p(x'y).  Thus 


dy  p{x'v) 

d2x'v  =  1  dp(x’y) 

dy2  P(r't)3  dy 

d3x'y  1  d2p{x'y)  3  /  dp(a-^)\2 

dy3  P[X’V)A  dy 2  p{z'y)1  \  dy  ) 

d^^y  _  1  d3p{x'll)  10  <P p(x'v)  dpjx'y) _ 15  (dp{x'y)V 

dy 4  p(x'y  )5  dy 3  +  p(x'y)6  dy2  dy  p{z'y)7  \  dy  ) 


where  we  have  used  d  dy  =  (dx'y  dy)d  dx'y  =  p(x'y)~1  d/dx'y.  to  perform  the  differentiation. 
We  may  thus  express  the  results  for  uy  (equation  (15))  and  ay  (equation  (16))  in  terms  of 
derivatives  of  p\x'y)  as  follows 


(2n  -  l)2dp(x;i  ^  /  n<  /  dp(x'y)  \  3  \ 
4pi*'y  i3  dy  \  dy  )  j 


(19) 

(20) 


2.6  Approximate  optimal  code  vector  positions 


We  now  have  all  the  basic  theoretical  results  that  are  needed  to  perform  a  Taylor  expansion 
of  equation  (11  )  to  relate  the  derivative  of  P(x  )  to  the  derivative  of  p{x  ).  Insert  the  7tyy  (de¬ 
fined  in  equation  (12}),  and  use  the  definitions  of  the  midpoint  Uy  and  half-length  ay  of  the 
interval  qy.n- (in  equation  (15)  and  equation  (16).  respectively),  in  equation  (11) 
to  obtain 


a;  *  i 

Jp(u 

(uv  +  r) 

’  dx  | 

) 

2a  u  P(u  )  w.  2a> 

!  +  O  I 

Sai 

\f  du\  ) 

2 ayP(uy)  +  0  | 

1 

dP\)  (  ^P(uy)\ 

3P(uv)  duy  \P(uv)  du2  J 

al  dP(x'y)  ^  /  uva\  d2P{x’y) 

3  P{x'y)  dx'y  \p{*'v)  dx'2 


(21) 


In  the  last  stage  of  this  derivation  note  that  the  effect  of  the  change  uy  — •  x'y  appears  only 
in  the  next  to  leading  order  terms. 
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2.7  Code  vector  density 

Finally,  inserting  the  expressions  for  uy  and  ay  (from  equation  (19)  and  equation  (20i)  into 
equation  (21),  we  obtain 


1  dp{x'y)  _ 
p(i')  dy  3  P(x')  dy 


1  +  h.o.. 


(22) 


where  h.o.t.  denotes  higher  order  terms.  We  may  solve  this  to  yield  in  leading  order 

p(x')  oc  P(x'  Y'3  (23) 


This  power  law  dependence  is  the  same  as  that  observed  in  the  equivalent  scalar  quantiser 
(which  corresponds  to  a  neighbourhood  function  rry\v  =  £y',v),  but  is  different  from  the 
result  that  was  obtained  in  [7]  for  a  TM  as  defined  in  [4], 


3  Density  in  1  dimension:  symmetric  neighbourhood  func¬ 
tions 


In  this  section  we  shall  extend  the  results  of  §2  to  the  case  where  ay _y  defines  a  symmetric 
(monotonically  decreasing  to  zero)  neighbourhood  function  surrounding  each  code  index. 
Furthermore,  we  shall  restrict  our  attention  to  symmetric  neighbourhoods. 


3.1  Robust  Euclidean  distortion 

We  shall  now  define  the  distortion  matrix  7ry .  y  (used  in  the  definition  of  the  Ij  distortion 
D i  in  equation  (7))  in  such  a  way  that  it  satisfies 

’V-i-i.y-n  =  *y.v  (Toeplitz  matrix)  . 

7Ty'  y  =  rry  v'  (symmetric  matrix) 

For  such  matrices  it  is  sufficient  to  specify  the  form  of  a  single  row  (or  column)  of  *yv  as  a 
symmetric  function  of  y'  -  y.  This  type  of  matrix  specifies  a  distortion  that  treats  each  code 
index  y  on  an  equal  footing  (the  Toeplitz  property),  and  it  implies  a  symmetric  topographic 
neighbourhood  (the  symmetric  property). 

For  convenience,  and  to  make  contact  with  the  derivation  presented  in  §2,  we  decompose 
*yiV  as  a  weighted  sum  over  (symmetric)  neighbourhood  functions  of  the  type  defined  in 
equation  (12) 

V,v  =  ^  (25) 

where  we  constrain  h,  >  0  to  ensure  that  flytV  is  a  monotonically  decreasing  function  of 
y'  -  y.  Note  that  this  monotonicity  constraint  is  in  addition  to  the  properties  that  we 
specified  in  equation  (24):  we  impose  it  to  cure  some  stability  problems  that  can  arise  when 
training  TVQs.  In  figure  3  we  give  an  example  of  the  type  of  composite  neighbourhood 
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- 1 — ^ —  I  t  l — ^ ^ ^ 

-4  -3  -2  -1  0  1  2  3  4  y'  -  y 

(b)  Decomposed  neighbourhood  function 


Figure  3:  Symmetric  composite  neighbourhood  function 


function  that  is  described  by  this  model.  We  show  in  figure  3(a)  a  typical  7ry'  y  neighbour¬ 
hood  function,  and  we  show  in  figure  3(b)  its  decomposition  as  a  sum  over  neighbourhood 
functions  spanning  the  intervals  (-n,,  for  various  s. 

Using  the  definition  of  7ry'  y  in  equation  (25)  we  can  simplify  equation  (7)  to  become 
D 3  given  by 

^3  =  *  dxP(x)  Y,h‘  jl  (*"**+✓)  (26) 

V  »  y'=-n. 


3.2  Minimum  robust  Euclidean  distortion 


Now  differentiate  D3  to  obtain  dDj/diy  and  dDj/dgy.  The  stationary  points  of  D$  must 
then  satisfy 


z„  = 


1  (^y-fn.  +  l  ^V-n,)  (^y+n.  +  l  U 


2  e.a.  (*; 

£.  A.  dzp(x)x 

h.  dxP(r) 


V+n,+l  zy-n 


.) 


(27) 


(28) 
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Equation  (28)  replaces  equation  (11)  and  equation  (27)  replaces  equation  (13). 

Note  that  the  positivity  of  the  h,  in  equation  (25)  guarantees  the  stability  of  the  solution 
for  the  x'y  and  the  qy  in  equation  (28)  and  equation  (27),  because  the  denominators  sire 
strictly  positive  3. 

Unfortunately,  the  expression  for  qv  in  equation  (27)  is  sufficiently  complicated  that  we 
have  to  perform  a  large  amount  of  algebra  to  derive  the  asymptotic  relationship  between 
P{x)  and  p(x)  (ie  the  generalisation  of  equation  (23)). 


3.3  Finite  differences  and  derivatives 


Firstly,  express  x'y+k  ±  x'y+l  in  terms  of  dx'yfdy  and  d?x'y/dy2. 


-2*0 

+ 

I  (*v+*  -  X'y- 

1  f 

XV~k  —  Xy~l 

= 

x  Hx'v+l  +  x'v-l  - 

2x») 

± 

\  (ZV+t  -  X'v-I 

4-  z'  x' 

V  —  'Ly 

k2  ±  l2  \  d2x'y 
2  )  dy* 


(t±''$  +  (<±r0 


(29) 


where  we  have  used  the  finite  difference  expressions  in  equation  (14).  Note  that  we  use  ± 
signs  consistently  throughout  equation  (29),  such  that  if  one  were  to  choose  the  upper  sign 
in  one  part  of  the  equation  then  one  must  choose  the  upper  sign  throughout  the  rest  of  the 
equation  (a  similar  remark  applies  to  the  lower  sign).  We  may  use  these  results  to  simplify 
9y..nj  and  qi/~nt-i  to  obtain 


1  T.t  ht  (6(0  +  6(3,0)  (bo  +  vi(j)  + 

2  E,M6(0  +  6(M)) 

1  Et  ht  (6(0  -  6(3,0)  (bo  -  vi{»)  +  bz(j.Q) 

2  £,M6(0- 6(3,0) 


(30) 


where  we  have  defined  £i(<),  £2(3,*),  t?o,  V\(»)  and  772(3,0  as 


dx' 

6(0  =  (2n«  +  1)  •— 

dy 

1  cPx' 

6(3,0  =  ^  [(”•  +n‘  +  “  "I)2] 

Vo  =  2x'y 

dx' 

f7i(s)  =  (2n,  +  1)  —~ 

dy 

Vi(»,t)  =  1  [(n,  +  n,  +  l)2  +  (n,  -  n,)Jj  (31) 


*We  assume  that  the  input  probability  density  P(x )  is  well-behaved,  in  the  sense  that  each  code  index  y 
is  indeed  associated  with  a  finite  probability  mass. 
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Note  that  (\(t)  and  rj\(s)  are  trivially  related  to  each  other. 

We  may  now  introduce  the  midpoint  u'  and  half-length  a'  of  the  interval  [9y-n<-i,  qy+n,\, 
and  use  equation  (30)  to  simplify  their  expressions.  For  compactness,  we  gather  these  two 
results  into  a  single  vector  equation  (where  it'  is  the  upper  element  and  a'  is  the  lower 
element  in  a  two-component  column  vector) 


1  I  ly-rn.  +  9y-n,-l 

2  \  ?y+n,  —  9y-nj-l 


6(O6(M)-6(0  6(M')  \ 
+  ^l(  }  {  6(06(0 -6(M)6(M*)  ) 


\  Et,t>  hthf 


Zu'hthf  (Zi(t)ti(t')  ~  6(M)6(M')) 

We  have  made  use  of  the  fact  that  V,  t,  h,ht’  (6(06(s>  * )  — 6(06(J»  O)  —  0  (by  symmetry) 
to  simplfy  the  denominator.  Now  introduce  some  approximations  which  are  valid  in  the 
leading  order  of  the  expansion  in  terms  of  derivatives  of  x'v  with  respect  to  y. 

6(M)6K0  -  o 

V2 (M)  (6(O6(M)~6(06(M'))  -  0  (33) 

When  we  insert  these  approximations  into  equation  (32),  and  we  make  use  of  the  relation¬ 
ships  in  equation  (18),  we  obtain 

,  1  Em'  Mr  6(06 (O  (ho  h2(«,0) 

S  “  2  Em' Mr  6(060') 

,  lEi^t  (2n(  -  1)  |(n,  +  n(  +  l)!  +  (n,  -  n()2J 
*'  4  Ei  (2n,  -el)  dy2 


,  _  (2 n,  -  l)2  1  E<6  (2n <  -*•  l)(nt  -  n,)(n,  -  n,  +  1)  , 

~  *v  4  2  E.M2n,  +  l)  p(x;)3"57p  J 

,  1  Em'  M«,  6(06(0)  r?i(i) 

°v  "  2  E(.c  Mr 6(06(0 

-  (35) 

2p(*y)  ^  ’ 

We  have  expressed  these  results  in  such  a  way  that  they  may  readily  be  compared  with  the 
analogous  results  in  equation  (19)  and  equation  (20). 


rK34) 


When  comparing  equation  (34)  with  equation  (19)  note  that  there  is  a  leading  order 
correction  term  caused  by  the  presence  of  more  than  one  component  in  the  composite 
neighbourhood  function,  but  note  that  equation  (20)  needs  no  such  leading  order  correction 
to  become  equation  (35). 


3.4  Approximate  optimal  code  vector  positions 

We  shall  now  form  a  Taylor  expansion  of  the  integrands  in  the  numerator  and  denominator 
of  equation  (28).  The  steps  in  the  derivation  are  similar  to  those  used  to  derive  the  result 
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in  equation  (21)  so  we  shall  not  repeat  them.  The  final  result  is 

+  (36) 

(o‘)  3  (aj)P(*;)  dx'y 

where  our  angle  bracket  notation  represents  a  average  weighted  by  the  h,,  which  is  defined 

E.M---)  (37) 


as 


(...) 


Z.h. 


The  ratios  of  weighted  averages  in  equation  (36)  may  be  evaluated  by  taking  appropriate 
averages  of  the  results  in  equation  (34)  and  equation  (35),  to  yield 


(°X)  ,  <(2n.  +  l)3)  dp(z[ 

(1)  =  '  " 
alY 


y  4  (2n,  +  1)  p{x'y)3  dx'y 
((2  n.  +  l)3) 


4  (271,  -  l)p(x^)2 


(38) 


where  we  note  that  the  contribution  of  the  correction  term  in  equation  (34)  disappears  (by 
svnunetrv). 


3.5  Code  vector  density 

Finally,  inserting  the  results  of  equation  (38)  into  the  leading  order  Taylor  expansion  in 
equation  (36)  yields  (in  leading  order)  the  same  differential  equation  that  we  obtained  in 
equation  (22).  Thus  we  have  shown  that  for  the  class  of  corresponding  to  symmetric 
monotonically  decreasing  neighbourhood  functions,  and  retaining  only  leading  order  terms, 
the  CV  density  is  given  by  p(r^)  y.  P(xy)l/3  (ie  the  same  as  equation  (23)). 


4  Density  in  N  dimensions:  standard  vector  quantiser 


In  this  section  we  shall  present  a  simple  derivation  of  the  CV  density  for  a  standard  VQ. 
We  believe  this  to  be  a  novel  way  of  deriving  this  result,  and  it  serves  to  underpin  the  TYQ 
case  that  we  discuss  in  §5. 

Unfortunately,  the  derivation  that  we  present  is  based  upon  a  qualitative  model  of  the 
distortions  that  occur  during  encoding/decoding,  so  the  results  that  we  obtain  remain  open 
to  some  criticism  V 

4.1  Vector  quantiser  model 

We  present  our  VQ  model  in  figure  4  where  we  use  a  fully  vectorised  form  of  the  notation  that 


4 We  would  welcome  comments  and  suggestions  on  how  one  might  improve  on  this  derivation. 
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input 

x 

*  -  y 


code 

y 


-o 


reconstruction 


Figure  4:  Standard  vector  quantiser 


we  used  in  earlier  sections  of  this  paper.  The  input  vector  *  is  encoded  (by  an  encoding 
function  y(*))  to  produce  an  index  y,  which  is  then  decoded  (by  a  decoding  function 
*'(y))  to  produce  a  reconstructed  vector  The  goal  is  to  choose  y(*)  and  x'(y)  so  as  to 
minimise  the  average  of  an  appropriately  chosen  distortion  measure  between  the  original  x 
and  reconstructed  x'  versions  of  the  input  vector. 


4.2  Euclidean  distortion 


Introduce  an  L 2  distortion  measure  D+  as 

D\  =  J  d*  P{*)  I  *  -  *'(y(*))  2  (39) 

where  P[x)  is  the  probability  density  over  possible  inputs,  and  the  notation  !!•••!,  denotes 
the  norm  of  the  enclosed  vector. 


In  preparation  for  our  reformulation  of  D+  in  terms  of  a  CV  density  we  shall  reexpress 
D4  as  an  integral  over  y  by  using  the  identity 


J  dy<c(y  -  y(x))  =  1 


(40) 


where  £(••■)  is  the  Dirac  delta  function.  Thus  we  may  rewrite  equation  (39)  as  follows 


Di 


J  dy  J  dx  P{x)S(  y  -  y(x))  \\x  -  x'(y(x))\\ 
J  dy  J  dxP{x)6(y  -  y(x))  \\x  -  *'(y)J 


*'(y)=(*>*,y 


J  dyP(y)  (ll*ll2)z|y  ~  2(e)a|y  .*'(y)  +  ||*'(y) 

/ dyP(y)  |(c)jB|y  -  *,(y)||J  +  ^1*  —  (*)*|y| 

/dyP(y)(|a:-<*)*ly||2} 


*iyj 


(41) 


where  (using  a  rather  careless  notation)  P(y)  =  J  dx  P{x)6(y  -  y{x))  is  the  probability 
density  over  possible  codes  8,  and  the  angle  brackets  (■••)a!|y  denote  an  average  over  the 

‘Strictly,  we  ihould  use  s  different  notation,  say  Pt( z)  and  F,(v).  for  the  two  probability  densities, 
because  they  are  different  functions  of  their  arguments.  However,  for  our  purposes,  it  is  always  possible  to 
deduce  tom  the  context  (ie  z  or  y)  which  probability  density  is  required. 
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posterior  probability  density  (of  *  given  y) — in  this  case  this  is  simply  an  average  over 
those  *  that  map  to  y  via  y(x). 

In  the  final  line  of  equation  (41)  we  replace  £>4  by  its  minimum  with  respect  to  x'(y): 
thus  we  shall  henceforth  set  *'(y)  =  (*)zjy.  We  may  interpret  equation  (41)  as  follows. 


1.  J  dyP(y)  is  an  average  over  the  possible  codes  y. 

2.  /|j* -(*)_,, ,||  )  breaks  down  thus: 

V1  ly"  / x,y 

(a)  {x)x,y  is  the  mean  of  the  set  of  x  that  map  to  y.  We  call  this  the  posterior 

mean,  because  the  average  uses  the  posterior  probability  P(x |y). 

(b)  x  -  (*)jp  is  the  error  vector  between  the  posterior  mean  and  the  true  input 
vector  x. 

(c)  Finally,  the  whole  of  this  expression  is  the  posterior  mean  of  the  norm  squared 
of  the  error  vector. 


Thus  equation  (41)  reduces  to  the  average  posterior  I2  distortion. 


4.3  Code  vector  density 

Let  us  rewrite  equation  (41)  by  introducing  a  CV  density  p(x)  which  describes  the  number 
of  CVs  x'(y(x))  per  unit  volume  in  the  input  space.  There  is  a  distortion  volume  in  input 
space  associated  with  CV  index  y  that  corresponds  to  the  set  of  *  that  satisfy  y  =  y(z)> 
and  the  characteristic  radius  L  of  this  volume  is  given  by 

i  =  p(*r*  (42) 

where  we  assume  that  the  input  space  is  A’-dimensional.  Note  that  we  have  assumed 
without  proof  that  a  single  radius  characterises  the  distortion  volume.  This  is  valid  because 
the  optimal  shape  for  distortion  volumes  is  spherical,  but  we  shall  not  let  ourselves  be 
sidetracked  by  such  details,  since  the  purpose  of  this  section  is  to  present  an  alternative 
strategy  for  deriving  the  CV  density  that  occurs  in  a  standard  VQ.  For  a  complete  derivation 
of  this  result  (not  assuming  isotropy)  see  appendix  B 

The  final  result  in  equation  (41 )  tells  us  that  each  CV  contributes  to  the  overall  distortion 
D 4  an  amount  which  is  the  product  of  its  probability  of  selection  times  a  posterior  mean 
squared  error.  Dimensional  analysis  tells  us  that  £>4  may  be  modelled  as 

£4  ~  J  dx  P{x)  p(*)“77  (43) 

Thus  we  have  replaced  the  rather  complicated  expression  in  equation  (41)  by  a  very  simple 
(approximately  equivalent)  expression  formulated  in  terms  of  the  CV  density. 
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4.4  Optimum  code  vector  density 


We  shall  now  minimise  D\  in  equation  (43)  subject  to  the  condition  that  the  total  number 
C  of  CVs  is  held  constant.  C  is  given  by 

C  =  J  dx  P(x)  p(x)  (44) 

so  we  must  minimise  the  composite  distortion  D\  given  by 

D\  =  D<  +  AC  (45) 

where  A  is  a  Lagrange  multiplier,  whose  value  will  be  chosen  later  to  guarantee  that  C  takes 
the  correct  value. 


Functionally  differentiating  D\  with  respect  to  p{x)  yields 

/j1:  =  -t :P(*)p(*r^  +  A  (46) 

tp(x)  A 

Thus  Sp(x)  =  0  when 

p(x)  a  P(z)'r+J  (47) 

This  is  the  required  CY  density,  which  correctly  reduces  to  p(x)  ex  P(x)lf3  in  the  1- 
dimensional  case,  as  expected.  This  result  may  easily  be  generalised  to  the  case  of  an  L, 

V 

distortion  measure  to  yield  p{x)  x.  P(x)  . 


5  Density  in  N  dimensions:  topographic  vector  quantiser 

In  this  section  we  shall  generalise  the  result  of  §4  to  take  account  of  a  non-zero  topographic 
neighbourhood,  thus  transforming  a  standard  VQ  into  a  TVQ. 

5.1  Topographic  vector  quantiser  model 

We  present  our  TVQ  model  in  figure  5  which  is  basically  the  same  as  figure  4  except  that 

code 

y 

*  - * ’  V  distortion  y'  — »  *' 

Figure  5:  Topographic  vector  quantiser 

we  introduce  a  distortion  process  that  acts  to  corrupt  the  code  y  to  produce  a  distorted 
code  y'.  We  choose  our  notation  carefully  to  be  a  vectorised  form  of  the  notation  that  we 
used  in  earlier  sections  of  this  paper. 


distorted  code 


o 


reconstruction 
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5.2  Robust  Euclidean  distortion 


Equation  (391  becomes 

Ds  =  jdxP(x)  j  dn  tt(„)  |jz  -  x'(y(x)  +  n)||2  (48) 

where  we  now  average  over  all  inputs  z  and  distortions  n.  tr(n)  is  the  probability  density 
over  distortions,  where  we  have  assumed  that  n  is  additive  and  statistically  independent 
of  z.  If  we  compare  equation  (39)  with  equation  (26),  we  observe  that  n (n)  describes  a 
multidimensional  (in  code  space)  version  of  the  Toeplitz  matrix  distortion  process  that  we 
used  in  the  scalar  case. 


By  analogy  with  equation  (41)  we  can  derive 


D  5 


(y)=(*)z  iy 


J  dy  j  dx  P( z )  tt( y  -  y(z))  jiz  -  x'(y)!  2 

J  dyP(y)  ^  p! z  -  (*)*  y  ^ 


(49) 


Note  that  the  £(y  -  y(z))  in  equation  (41)  has  simply  been  replaced  by  7r(y  -  y(z)) 
in  equation  (49).  The  angle  brackets  (•■•}a.y  still  denote  an  average  over  the  posterior 
probability  density  (of  z  given  y),  which  is 


P(*:y) 


y(y  -  y(z))P(*) 
P(y) 


( 50  i 


This  yields  the  correct  result  in  the  special  case  that  rr(y  -  y(x))  —  6(y  -  y(*)),  as  in 
equation  (41 ). 

It  is  important  to  note  that  the  CY  z'(y )  that  minimises  Dt,  is  still  the  mean  of  the 
posterior  probability  *'(y)  =  (*)x  y  In  fact,  the  only  difference  between  equation  (41) 
and  equation  (49)  is  the  replacement  £(y  -  y(z))  —  7r(y  -  y(*)). 


5.3  Topographic  code  vector  density 

The  distortion  process  that  is  modelled  by  7r(n)  describes  a  neighbourhood  function  that 
corresponds  to  a  Toeplitz  matrix  (as  in  equation  (24)),  so  the  distortion  volume  that  is 
associated  with  the  posterior  probability  P(z|y')  is  proportional  to  to  the  distortion  volume 
that  is  associated  with  P(z|y),  with  the  same  constant  of  proportionality  for  each  CV.  \Ve 
may  thus  replace  equation  (43)  by 

Dt,  cx  J  dx  P(z)p(z)-^  (51) 

Thus  the  same  optimum  CV  density  emerges  as  we  obtained  in  the  standard  VQ  derivation 
starting  at  equation  (42). 

In  this  sketch  derivation  of  the  asymptotic  CV  density  in  a  TVQ  we  have  assumed 
that  the  optimum  distortion  volume  (corresponding  to  P(x\y'))  is  characterised  by  a  single 
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length.  The  full  proof  of  the  TYQ  case  follows  a  similar  pattern  to  that  found  in  appendix  B 
for  the  VQ  case  6. 

5.4  Intuitive  interpretation  of  the  equivalence  between  topographic  and 
plain  vector  quantisers 

We  have  presented  a  lot  of  mathematics  in  our  various  derivations  of  asymptotic  CV  densi¬ 
ties.  In  all  cases  the  result  has  been  unmodified  by  neighbourhood  functions  provided  that 
we  consistently  take  their  effect  into  account  during  the  encoding/decoding  processes  (ie 
use  minimum  distortion  encoding  rather  than  nearest  neighbour  encoding). 

We  wish  to  present  an  intuitive  picture  of  the  processes  at  work  that  conspire  to  make 
the  standard  YQ  and  the  TYQ  equivalent,  in  the  sense  of  code  vector  densities.  In  figure  6 


(a)  Plain  vector  quantiser  (b)  Topographic  vector  quantiser 


Figure  6:  Equivalence  of  standard  and  topographic  vector  quantisers 


we  show  how  the  equivalence  emerges  when  the  input  x  is  2-dimensional  and  the  code  y 
has  a  1-dimensional  topology  (which  can  be  turned  on  and  off  at  will).  For  the  purposes 
of  this  argument  we  shall  assume  that  P(x)  is  uniform.  In  figure  6(a)  we  idealise  the  code 
vectors  (and  their  cells)  of  a  YQ  as  sitting  on  a  square  lattice7.  In  figure  6(b)  we  show- 
how  these  same  code  vectors  (and  their  cells)  would  be  modified  when  a  1-dimensional 
neighbourhood  function  is  introduced.  In  our  example  we  use  a  neighbourhood  that  ranges 
over  the  [— l,  +  l]  neighbourhood  about  each  code  vector  index. 

The  net  effect  of  the  neighbourhood  function  is  to  attract  neighbouring  code  vectors 
together,  and  simultaneously  to  repel  more  distantly  separated  pieces  of  the  1-dimensional 
“line”  of  code  vectors.  The  overall  effect  is  to  preserve  the  density  of  code  vectors  that 
would  have  been  found  in  a  standard  VQ  (zero  neighbourhood  size),  and  to  ensure  that 
the  posterior  covariance  P(x\y')  is  isotropic.  In  our  example  there  is  a  compression  by 
a  factor  of  \/Z  in  the  horizontal  direction,  and  a  expansion  in  the  vertical  direction  by  a 
factor  of  \/3,  thus  guaranteeing  that  the  area  associated  with  each  code  vector  remains 

'We  leave  the  details  as  an  exercise  for  the  reader!  Although,  before  attempting  to  derive  the  result  it 
should  prove  useful  to  read  the  intuitive  arguments  in  {$5.4. 

rIn  fact,  this  arrangement  does  not  minimise  Dt ,  but  it  will  serve  to  illustrate  our  argument. 
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unchanged  (ie  density  is  unchanged).  Furthermore,  P(zjy')  is  represented  by  the  bold 
square  superimposed  on  each  diagram:  in  both  cases  it  is  isotropic  8.  Note  how  the  posterior 
covariance  includes  3  quantisation  cells  in  the  TVQ,  due  to  the  [-l,+l]  neighbourhood 
function. 

This  intuitive  picture  is  very  convenient,  because  we  can  imagine  that  the  sole  effect 
of  introducing  a  topology  in  the  code  book  is  to  attract  neighbouring  code  vectors  whilst 
effectively  repelling  more  distant  code  vectors  (due  to  conservation  of  the  total  number  of 
code  vectors).  In  the  case  of  an  d-dimensional  neighbourhood  function,  the  code  vectors 
would  attract  themselves  into  d-dimensional  sheets  embedded  in  the  input  space.  This  sheet 
would  repeatedly  fold  over  on  itself  (like  puff  pastry)  to  fill  the  input  space.  However,  the 
sheet  would  not  collaps  onto  itself  because  there  is  an  effective  repulsion  between  different 
folds  of  the  sheet.  Overall,  the  net  density  of  code  vectors  is  the  same  as  would  have  been 
encountered  in  a  standard  VQ. 

Remember  that  nearest  neighbour  encoding  will  not  lead  to  this  convenient  result:  you 
have  to  use  minimum  distortion  encoding  which  takes  into  account  the  effect  of  the  process 
that  distorts  the  code  indices. 


6  Numerical  simulation 


In  this  section  we  shall  present  the  results  of  a  simple  numerical  simulation  that  verifies  our 
theoretical  prediction  p(x)  ot  P(x)1'3  in  the  1-dimensional  case. 


0.1  Numerical  experimental  procedure 

We  shall  now  describe  the  numerical  experiment  that  was  performed  in  [7] . 

1.  Define  a  finite  support  for  x:  x  £  [0, 1  ]. 

2.  Define  a  probability  density  P(z)  of  input  scalars:  P(x)  -  2x  (ie  a  ramp). 

3.  Choose  the  number  nn.  of  CVs  (in  this  case,  code  scalars)  that  you  wish  to  use.  In  [7) 
n n  =  100  was  used,  but  we  shall  use  n„  =  30  because  we  find  that  the  results  still 
approximate  those  that  we  would  expect  in  the  asymptotic  (ie  n„,  — *  oc)  case. 

4.  Choose  the  number  n  that  determines  the  size  of  the  [-n,  -fin]  topographic  neighbour¬ 
hood,  in  the  form  given  in  equation  (12). 

5.  Adapt  the  positions  of  the  CVs  using  the  standard  training  scheme  for  TMs  [4].  We 
use  a  simplified  version  of  the  method  in  [7]  where  we  use  an  update  step  size  e  =  0.1 
(in  equation  (54)),  and  we  train  for  500000  presentations  of  x  samples  chosen  randomly 
from  P(x)  9.  Combining  the  c  and  ’!‘v,y(r)  factors  in  equation  (54),  we  use  an  overall 

*At  least,  it  if  as  isotropic  as  the  lattice  model  that  we  have  introduced  will  allow! 

’It  is  not  clear  that  this  guarantees  complete  convergence,  but  the  training  schedule  is  good  enough  to 
demonstrate  the  point  that  we  wish  to  make. 
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Asymptotic  code  vector  density 


update  scheme 


(ntu1) 

v 


r'J-0‘d)  +  0.1  (/  -  z'v(old]) 


ly  -  y(*)l  <  n 


(52) 


6.  Break  up  the  10,  l]  interval  into  histogram  bins,  each  of  which  covers  a  small  interval 
[b  -  A/2 ,b  +  A/2]  of  width  A  centred  at  x  -  b.  As  in  [7]  we  choose  to  use  10  bins, 
so  A  =  0.1  and  the  6  are  drawn  from  the  set  {0.05, 0.15, ...,  0.85, 0.95}.  These  bins 
are  used  to  estimate  the  relative  frequency  with  which  the  CVs  land  in  each  of  the  10 
intervals. 

In  our  experiments  we  modified  the  procedure  that  was  used  in  [7]  by  incrementing 
the  bins  after  every  n„  (=  30  in  our  experiments)  training  samples.  Each  bin  thus 
contains  a  cumulative  count  of  the  number  of  CVs  that  have  appeared  in  it  (summed 
over  all  of  the  “snapshots"  taken  at  intervals  of  nn  samples). 

This  procedure  for  incrementing  the  histogram  has  an  infinitely  long  memory  time,  so 
the  final  histogram  (after  500000  samples)  will  be  the  mixture  of  all  histograms  that 
occurred  as  training  progressed  towards  convergence.  This  is  obviously  undesirable, 
and  could  be  cured  by  imposing  a  finite  memory  by,  for  instance,  making  the  histogram 
bins  "leaky-'.  We  do  not  implement  such  refinements. 

7.  Do  a  least  squares  fit  of  p(  x 1  versus  P{x)  as  follows: 

(a)  For  histogram  bin  b  -  A  '2.6  -  A  2  determine  two  quantities: 

i.  The  probability  P,  that  P\x)  generates  a  point  lying  in  bin  i:  this  can  be 
calculated  to  be  26,  A. 

ii.  The  probability  p,  that  a  C’V  lies  in  bin  t:  this  is  estimated  from  the  outcome 
of  the  numerical  experiment  as  the  proportion  of  CVs  that  land  in  bin  i. 

(b)  Plot  the  ( P,,p,)  coordinates  from  the  previous  step  on  a  (P,p)  graph  (P  is  the 
abscissa,  and  p  is  the  ordinate). 

( c )  Define  a  paramet  ric  model  p(  P )  =  A  Pa . 

(d)  Find  the  .4  and  a  that  minimise  the  sum  of  squared  errors  V),(p(P,  -  p,)):. 
Because  we  use  a  rather  small  value  of  n„  we  find  that  edge  effects  (near  x  -  0 
and  i  =  l)  adversely  affect  the  number  of  counts  in  the  6  =  0.05  (left-most)  and 
6  =  0.95  (right-most)  histogram  bins.  We  therefore  discard  these  two  bins  and 
use  only  the  central  8  (out  of  10)  bins  to  estimate  A  and  a. 

In  the  fitting  process  we  do  not  do  a  least  squares  fit  to  a  logarithmic  plot, 
because  it  is  the  estimated  p,  (not  the  estimated  logp,)  that  has  approximately 
Gaussian  errors. 

8.  Repeat  the  experiment  many  times  in  order  to  obtain  many  estimates  for  A  and  a, 
so  that  an  improved  estimate  with  a  reduced  standard  error  can  be  deduced.  We  do 
not  implement  this  step. 

We  wrote  a  program  to  implement  this  numerical  experiment,  and  we  found  that  we 
could  reproduce  the  results  quoted  in  [7],  when  we  used  the  standard  TM  training  scheme 

[4]- 


S  P  Luttrell,  24th  May  1990 
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In  order  to  modify  the  experiment  so  that  it  conforms  to  our  TVQ  formulation  of  TMs, 
we  need  to  alter  the  neighbourhood  function  used  in  step  4  of  the  numerical  experiment. 
We  used  a  neighbourhood  of  the  type  shown  in  figure  3  with  just  two  components:  a  0,0 
neighbourhood  term,  plus  various  types  of  [— 1,-t-l]  neighbourhood  term.  Combining  the  e 
and  7rvy(r)  factors  in  equation  (54).  we  use  an  overall  update  sciieme 


/  (nfu‘) 

T 

»(*) 

/  (nfv) 
7 

v(*)±l 


x'1??  +  0.1 

id*) 

'{old) 


I  -  X 


i  +  e'(z  ~ 


'(old), 
»(«)  > 
I  (old) 

‘v(*)± 


t) 


(53) 


where  we  choose  (’  to  be  one  of  {0.025,0.050,0.075}. 

We  also  need  to  change  step  5  of  the  numerical  experiment.  In  this  step  we  should  use 
minimum  distortion  encoding  rather  than  nearest  neighbour  encoding.  In  fact,  we  perform 
experiments  using  both  types  of  encoding  scheme  in  order  to  compare  the  two  outcomes. 

Apart  from  these  two  changes,  the  TVQ  experimental  procedure  is  identical  to  the 
uniform  neighbourhood  function  experimental  procedure  described  above. 


6.2  Numerical  experimental  results 


The  h,:. a!  estimated  values  of  o  lie  after  500000  updates)  are  shown  in  table  1. 


t 

TM 

TVQ 

0.025 

0.51 

0.32  j| 

i  0.0  50 

0.51 

0.31 

0.075 

0.52 

0.35 

Table  I:  Table  of  asymptotic  power  law  a  for  various  simple  neighbourhood  functions.  Both 
the  topogtaphic  mapping  case  and  the  topographic  vector  quantiser  case  are  shown. 


These  are  the  results  of  a  single  run,  so  we  have  not  attempted  to  quote  a  standard 
error.  Note  how  the  TM  ca'-e  consistently  produces  an  a  which  is  much  larger  than  the 
q  —  13  that  a  standard  YQ  would  produce,  but  the  TVQ  case  produces  a  result  which  is 
the  same  as  a  standard  VQ,  If  our  asymptotic  theory  is  correct,  then  the  slight  differences 
(between  1/3  in  the  VO  case  and  the  0.31-0.35  observed  in  the  TVQ  case)  are  because 
next  to  leading  order  ,.  erections  contribute  slightly  for  the  =  30  case  that  we  simulate. 
There  are  other  possible  explanations  for  these  slight  differences,  such  as  the  crudity  of  our 
process  for  estimating  a. 

We  show  in  figure  7  a  plot  of  the  estimates  for  o  as  a  function  of  (the  logarithm  to  base  10 
of}  the  number  of  training  steps.  These  plots  fall  into  two  clearly  separated  categories  which 
asymptotically  approach  (as  the  number  of  training  steps  increases)  the  results  tabulated 
in  table  1. 

Even  for  nn  =  30  we  can  easily  produce  a  result  for  a  that  is  remarkably  close  to  the 
theoretically  predicted  asymptotic  (ie  n„  — *  oo)  result.  This  is  very  encouraging  because  we 
may  use  our  asymptotic  result  with  confidence  to  predict  the  density  of  CVs  as  a  function 
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Asymptotic  code  vector  density 


1  igure  7:  Plot  of  o  versa*  the  number  of  training  steps  for  both  the  TM  and  the  TVQ  cases. 
I  he  plots  are  coded  as  follows:  ('  =  0.025  (solid),  t'  =  0.050  (dashes).  ('  =  0.075  (dots) 

of  the  input  probability  density,  apart  from  some  possible  edge  effects  (that  we  artificially 
removed  from  our  numerical  simulation). 


7  Conclusions  and  discussion 


The  asymptotic  code  vector  density  observed  in  topographic  mappings  depends  on  whether 
we  use  the  method  advocated  in  (4j  (ie  nearest  neighbour  encoding),  or  our  own  method  (ie 
minimum  Lj  distortion  encoding).  Our  own  method  of  optimising  topographic  mappings 
has  the  advantage  of  having  a  simple  vector  quantiser  interpretation,  and  it  leads  to  an 
asymptotic  density  which  is  the  same  (in  leading  order)  as  the  plain  vector  quantiser  result. 

However,  one  has  to  be  careful  to  avoid  certain  instabilities  that  can  occur  when  op¬ 
timising  topographic  vector  quantisers.  For  instance,  a  uniform  [-n,+n]  neighbourhood 
function  leads  to  metastable  solutions  in  which  the  code  vectors  tend  to  collapse  into  clus¬ 
ters  of  2n  +  1  code  vectors.  This  does  not  adversely  affect  the  distortion,  which  is  the  same 


21 


as  would  have  been  obtained  had  the  clusters  been  smoothed  out.  However,  we  find  that  for 
monotonically  decreasing  (as  a  function  of  radius)  neighbourhood  functions  this  clustering 
instability  does  not  occur,  although  for  slowly  decreasing  neighbourhood  functions  one  has 
to  perform  small  update  steps  during  training  in  order  to  avoid  clustering  problems. 

Our  derivation  of  the  asymptotic  code  vector  density  in  N  dimensions  leaves  something 
to  be  desired,  and  it  should  be  regarded  as  no  more  than  a  suggestive  model  of  what  might 
actually  occur  in  N  dimensions. 

A  possible  application  of  our  result  is  to  estimate  the  total  input  probability  associated 
with  each  code  vector.  The  simple  relationship  between  code  vector  density  and  input 
probability  density  makes  this  calculation  simple  to  perform.  More  generally,  it  should 
prove  to  be  theoretically  convenient  that  we  have  obtained  a  code  vector  density  that  is  the 
same  as  in  the  plain  vector  quantiser. 

Finally,  the  use  of  a  topographic  neighbourhood  function  disrupts  the  properties  of 
a  vector  quantiser  less  than  had  hitherto  been  thought,  provided  that  we  use  minimum 
distortion  (rather  than  nearest  neighbour)  encoding. 


Appendix  A  Density  in  1  dimension:  Ritter’s  theory 


In  this  appendix  we  shall  summarise  the  theory  of  the  density  of  code  vectors  in  a  scalar 
topographic  mapping  as  presented  in  We  have  deliberately  changed  the  notation  of  ’7j 
to  conform  to  our  ow:.  choice  of  notation,  but  have  otherwise  retained  the  theory  intact. 


A.l  Update  procedure 


The  derivation  in  '7  is  based  upon  the  training  procedure  in  [4),  which  is  given  as  a  prescrip¬ 
tion  for  updating  a  set  of  vectors  x'y.  We  shall  refer  to  these  x'y  vectors  as  code  vectors  in 
order  to  conform  with  our  own  vector  quantiser  version  of  topographic  mappings,  although 
the  update  scheme  in  (4]  does  not  have  a  simple  vector  quantiser  interpretation. 

The  update  scheme  is 


l(netr) 

V 


(54) 


where  is  a  neighbourhood  function.  If  *y,v(x)  =  ^»,»(*)  then  only  is  updated. 

If,  on  the  other  hand,  iry  „(*)  has  non-zero  off-diagonal  elements  then  other  (neighbouring) 
x'v  ( y  5 ^  y(x)}  are  updated  as  well.  The  theory  in  [7]  uses  the  same  7ry  v(je)  that  we  defined 
in  equation  (12),  which  specifies  that  equation  (54)  updates  those  code  vectors  x'y  whose 
index  y  lies  within  the  [-n,  +  n]  neighbourhood  of  y(z). 
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A. 2  Finite  differences  and  derivatives 


Ln  [7’  (and  in  [4])  it  is  assumed  that  the  qy  are  given  by 


9v  =  \  (*i  +  *i+i)  (55) 

which  is  the  same  as  the  qv  used  in  a  minimum  Li  distortion  vector  quantiser  (see  equa¬ 
tion  (5)),  but  is  different  from  the  qy  used  in  the  topographic  version  of  the  same  vector 
quantiser  (see  equation  (13))  (assuming  a  [  — n, +n]  neighbourhood).  It  is  the  difference 
between  equation  (55)  and  equation  (13)  that  prevents  equation  (54)  from  leading  to  a 
minimum  Li  distortion  when  a  non-zero  neighbourhood  size  is  used.. 


The  midpoint  and  half-length  analogous  to  our  equation  (15)  and  equation  (16)  are  then 
given  by 


—  2  (<?■-"  — 1  ■"  9tfn) 

__  1  ,  /  t  i  .  i  \ 

=  \  (a-i-n-l  *  -  2j()  -  i  (*!-„ 

=  r»  (n^  l)^n2d2z;  c  (n*d4< 


2z[ 


di 2 


diA 


(56) 


a,  - 


2  (?i~n  <?i-n-  1  J 


2n  —  1  dr  l 
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-  O  n 


1  t  ' 
3./ 


di3 


(57) 


A. 3  Update  equilibrium 


When  the  update  procedure  in  equation  (54)  has  converged  then  on  average  there  is  no  net 
tendency  for  tiny  of  the  code  vectors  to  move  to  the  right  or  the  left:  equilibrium  has  been 
attained.  This  may  be  expressed  as 

J dx  P(x)  trViV(x)(x  -  x’J  =  0  (58) 


which  we  may  rearrange  by  dividing  up  the  range  of  integration  over  z  into  intervals  j qv- 1 ,  qv) 
to  yield 


/  ’  dx  P(x )*•„,„<( z  -  x'v)  =  0 

V' 


(59) 


wluch  should  be  compared  with  the  derivative  in  equation  (9). 
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In  [7],  an  equation  for  x'y  is  then  obtained  that  is  identical  to  our  equation  (11),  except 
that  his  qy  are  specified  by  equation  (55)  rather  than  equation  (13) 

We  do  not  need  to  repeat  the  derivation  of  an  approximation  to  xy,  because  our  result 
in  equation  (21)  is  applicable,  provided  that  we  use  the  expressions  for  uy  and  av  given  in 
equation  (56)  and  equation  (57). 


A. 4  Code  vector  density 


Finally  in  [7],  an  equation  that  is  analogous  to  equation  (22)  is  obtained 


1  dp(x'y)  ^  g  dP{x'y) 
P(x'y)  dy  P(!rJ,)  dy 


(60) 


where  a  is  given  by 


=  1  (2n  4-  l)2 

3  {n  +  1  )2  +  n2 


(61) 


In  leading  order,  the  solution  to  the  first  order  differential  equation  in  equation  (60)  is 


PK)  *  />(*')“  (62) 

When  n  =  0  the  power  law  a  becomes  a  =  1/3,  and  in  the  limit  n  —  oc  the  power  law  a 
becomes  a  =  2  3.  Equation  (62) 


A. 5  Comparison  of  Kohonen/Ritter  method  with  Luttrell  method 

We  observe  that  the  use  of  the  standard  update  procedure  in  equation  (54)  based  on  nearest 
neighbour  coding  in  equation  (55)  leads  to  a  complicated  power  law  dependence,  whereas 
our  own  scheme  whereby  we  minimise  an  appropriately  chosen  L j  distortion  measure  leads 
to  a  power  law  dependence  that  is  the  same  as  that  observed  in  a  standard  vector  quantiser 
(with  zero  neighbourhood  size). 

The  price  to  pay  for  the  convenience  of  using  a  minimum  L 2  distortion  approach  (rather 
than  the  standard  approach  in  [4])  is  that  encoding  method  is  no  longer  nearest  neighbour, 
and  there  are  certain  stability  problems  that  can  arise  when  7rv»>v  is  chosen  inappropriately. 


Appendix  B  Code  vector  density:  full  derivation 


In  this  appendix  we  shall  present  a  more  extensive  derivation  of  the  optimum  code  vector 
density  that  does  not  assume  that  the  distortion  volume  is  characterised  by  a  single  radius. 
Thus  we  introduce  a  more  sophisticated  model  in  which  we  use  a  full  covariance  matrix  to 
model  the  possibly  anisotropic  shape  of  the  distortion  volume,  and  we  relate  this  covariance 
matrix  to  the  required  code  vector  density  in  order  to  make  contact  with  the  problem  that 
we  are  trying  to  solve. 
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B.l  Euclidean  distortion 

First  of  all  write  the  final  result  in  equation  (41)  as 

D6  =  JdyP(y)  trace a(y) 

=  J  dy  P(x)  trace  a(y(x))  (63) 

where  cr(y)  is  the  posterior  covariance  matrix  defined  as 

*(y)  EE  ((*  -  <*>*,„)  (*  -  (*)*iy)7'}z|y  (64) 

where  the  superscript  “T”  denotes  “transpose”. 


B.2  Correspondence  between  code  vector  density  and  posterior  covari¬ 
ance 

By  comparing  equation  (63)  with  equation  (43)  we  may  identify  the  correspondence 

p(jr)“£  —  trace<7(y(;e))  (65) 

We  shali  use  this  relationship  to  determine  the  effective  p{z)  that  corresponds  to  any  par¬ 
ticular  choice  of  cr(y(z)). 


B.3  Functional  derivative  of  Euclidean  distortion 


We  need  to  minimise  De  with  respect  to  the  elements  of  the  matrix  cr(y),  subject  to  the 
condition  that  the  total  volume  associated  with  cr(y)  (for  all  y)  is  held  constant.  Thus 
introduce  C  by  analogy  with  equation  (44)  as 

C  =  J  dz\  det  cr(y( a:))’- 5  (66) 

and  by  analogy  with  equation  (45)  we  minimise  D'6  =  D 6  +  AC  with  respect  to  the  elements 
of  the  matrix  cr(y). 


In  order  to  functionally  differentiate  with  respect  to  <7,;(£)  we  need  the  following 
two  results 


f  trace  <7(y) 

6°.M) 

6  [det  <r(y)]~? 


=  tijf(y-t) 

=  -;[det<7(y)]-5  [<r(y)]"1i(y  -  {) 


(67) 


(68) 
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By  inserting  the  results  in  equation  (67)  and  equation  (68)  into  6D'6/So,j(y)  we  obtain 

i^(y)  =  J dxP(*)£as(y(*)  -  y)  ~  \Jdx  Hy)}^ s(y(x)  -  y)[det^(y)j'’ 

=  J  dx£{y(x)  -  y)  |p(*)^  -  ^  [det<x(y)]~5  [^(vjjj;1  (69) 

B.4  Minimum  Euclidean  distortion 

We  must  now  solve  the  stationarity  condition  6D'6/6<Tij( y)  =  0  to  find  the  minimum  dis¬ 
tortion  choice  of  cq,(y).  If  we  ignore  the  assumed  small  variation  of  P(x)  as  *  ranges 
over  inputs  that  map  to  y  =  y(*)  (this  approximation  is  valid  in  leading  order),  then  the 
stationarity  condition  reduces  to  requiring  that  the  integrand  of  equation  (69)  be  zero.  Thus 

^  [det  cr( v(* )}■“ ^  [cr( y/( a: ) )] J,1  (70) 

which  implies  that 

0tAy)  ~  ("1) 

where  the  o0(y)  on  the  right  hand  side  is  a  scalar.  We  have  thus  deduced  that  (for  each 
code  vector)  the  optimum  posterior  covariance  matrix  is  characterised  by  a  single  parameter 
cr0(y)  corresponding  to  the  (squared)  radius  of  the  corresponding  distortion  volume. 

B.5  Code  vector  density 

Combining  equation  (70)  and  equation  (71)  yields 

P(x)  oc  '<ro(y(*))i_7  [<7o(y(*))'_1 

*  i^otyt*)))-^  (72) 

which  we  may  invert  to  obtain  the  optimum  ao(y(*))  in  terms  of  the  input  probability 
density  P{x ) 

oro(y(*))oc  P{x)-ffe  (73) 

and  using  the  correspondence  in  equation  (65)  we  may  also  obtain  p{x)  in  terms  of  P(x)  as 

p(x)  oc  P(ce)*+ T  (74) 

which  is  the  required  result. 
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